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Abstract — Water-filling is the term for the classic solution to 
the problem of allocating constrained power to a set of parallel 
channels to maximize the total data-rate. It is used widely in 
practice, for example, for power allocation to sub-carriers in 
multi-user OFDM systems such as WiMax. The classic water- 
filling algorithm is deterministic and requires perfect knowledge 
of the channel gain to noise ratios. In this paper we consider how 
to do power allocation over stochastically time-varying (i.i.d.) 
channels with unknown gain to noise ratio distributions. We 
adopt an online learning framework based on stochastic multi- 
armed bandits. We consider two variations of the problem, one 
in which the goal is to find a power allocation to maximize 
yjE[log(l + SNRi)], and another in which the goal is to find a 

i 

power allocation to maximize 5Zlog(l + E[SAT-Ri]). For the first 

i 

problem, we propose a cognitive water-filling algorithm that we 
call CWF1. We show that CWF1 obtains a regret (defined as the 
cumulative gap over time between the sum-rate obtained by a 
distribution-aware genie and this policy) that grows polynomially 
in the number of channels and logarithmically in time, implying 
that it asymptotically achieves the optimal time-averaged rate 
that can be obtained when the gain distributions are known. 
For the second problem, we present an algorithm called CWF2, 
which is, to our knowledge, the first algorithm in the literature on 
stochastic multi-armed bandits to exploit non-linear dependencies 
between the arms. We prove that the number of times CWF2 
picks the incorrect power allocation is bounded by a function 
that is polynomial in the number of channels and logarithmic in 
time, implying that its frequency of incorrect allocation tends to 
zero. 

I. Introduction 

A fundamental resource allocation problem that arises in 
many settings in communication networks is to allocate a 
constrained amount of power across many parallel channels 
in order to maximize the sum-rate. Assuming that the power- 
rate function for each channel is proportional to log(l +SNR) 
as per the Shannon's capacity theorem for AWGN channels, 
it is well known that the optimal power allocation can be 
determined by a water- filling strategy [1|. The classic water- 
filling solution is a deterministic algorithm, and requires 
perfect knowledge of all channel gain to noise ratios. 

In practice, however, channel gain-to-noise ratios are 
stochastic quantities. To handle this randomness, we consider 
an alternative approach, based on online learning, specifically 
stochastic multi-armed bandits. We formulate the problem of 
stochastic water-filling as follows: time is discretized into 
slots; each channel's gain-to-noise ratio is modeled as an 



i.i.d. random variable with an unknown distribution. In our 
general formulation, the power-to-rate function for each chan- 
nel is allowed to be any sub-additive function Q. We seek a 
power allocation that maximizes the expected sum-rate (i.e., 
an optimization of the form E[J^log(l + SNRi)}). Even if 

i 

the channel gain-to-noise ratios are random variables with 
known distributions, this turns out to be a hard combinatorial 
stochastic optimization problem. Our focus in this paper is 
thus on a more challenging case. 

In the classical multi-armed bandit, there is a player playing 
K arms that yield stochastic rewards with unknown means at 
each time in i.i.d. fashion over time. The player seeks a policy 
to maximize its total expected reward over time. The perfor- 
mance metric of interest in such problems is regret, defined as 
the cumulative difference in expected reward between a model- 
aware genie and that obtained by the given learning policy. 
And it is of interest to show that the regret grows sub-linearly 
with time so that the time-averaged regret asymptotically goes 
to zero, implying that the time-averaged reward of the model- 
aware genie is obtained asymptotically by the learning policy. 

We show that it is possible to map the problem of stochastic 
water-filling to an MAB formulation by treating each possible 
power allocation as an arm (we consider discrete power levels 
in this paper; if there are P possible power levels for each 
of N channels, there would be P N total arms.) We present 
a novel combinatorial policy for this problem that we call 
CWF1, that yields regret growing polynomially in N and 
logarithmically over time. Despite the exponential growing set 
of arms, the CWF1 observes and maintains information for 
P ■ N variables, one corresponding to each power-level and 
channel, and exploits linear dependencies between the arms 
based on these variables. 

Typically, the way the randomness in the channel gain to 
noise ratios is dealt with is that the mean channel gain to 
noise ratios are estimated first based on averaging a finite 
set of training observations and then the estimated gains are 
used in a deterministic water-filling procedure. Essentially this 
approach tries to identify the power allocation that maximizes 
a pseudo-sum-rate, which is determined based on the power- 
rate equation applied to the mean channel gain-to-noise ratios 

'A function / is subadditive if f(x + y) < f(x) + f(y); for any concave 
function g, if g(0) > (such as log(l + x)), g is subadditive. 
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(i.e., an optimization of the form ^log(l + K[SNRi\). We 

i 

also present a different stochastic water-filling algorithm that 
we call CWF2, which learns to do this in an online fashion. 
This algorithm observes and maintains information for N 
variables, one corresponding to each channel, and exploits 
non-linear dependencies between the arms based on these 
variables. To our knowledge, CWF2 is the first MAB algorithm 
to exploit non-linear dependencies between the arms. We 
show that the number of times CWF2 plays a non-optimal 
combination of powers is uniformly bounded by a function 
that is logarithmic in time. Under some restrictive conditions, 
CWF2 may also solve the first problem more efficiently. 

II. Related Work 

The classic water- filling strategy is described in [1|. There 
are a few other stochastic variations of water-filling that have 
been covered in the literature that are different in spirit from 
our formulation. When a fading distribution over the gains 
is known a priori, the power constraint is expressed over 
time, and the instantaneous gains are also known, then a 
deterministic joint frequency-time water-filling strategy can be 
used 121, Q. In H, a stochastic gradient approach based on 
Lagrange duality is proposed to solve this problem when the 
fading distribution is unknown but still instantaneous gains are 
available. By contrast, in our work we do not assume that the 
instantaneous gains are known, and focus on keeping the same 
power constraint at each time while considering unknown gain 
distributions. 

Another work [5] considers water- filling over stochastic 
non-stationary fading channels, and proposes an adaptive 
learning algorithm that tracks the time-varying optimal power 
allocation by incorporating a forgetting factor. However, the 
focus of their algorithm is on minimizing the maximum 
mean squared error assuming imperfect channel estimates, 
and they prove only that their algorithm would converge in 
a stationary setting. Although their algorithm can be viewed 
as a learning mechanism, they do not treat stochastic water- 
filling from the perspective of multi-armed bandits, which is 
a novel contribution of our work. In our work, we focus on 
stationary setting with perfect channel estimates, but prove 
stronger results, showing that our learning algorithm not only 
converges to the optimal allocation, it does so with sub-linear 
regret. 

There has been a long line of work on stochastic multi- 
armed bandits involving playing arms yielding stochastically 
time varying rewards with unknown distributions. Several 
authors |6|-|9| present learning policies that yield regret 
growing logarithmically over time (asymptotically, in the case 
of |6|-|8| and uniformly over time in the case of |9|). Our 
algorithms build on the UCB1 algorithm proposed in |9| but 
make significant modifications to handle the combinatorial 
nature of the arms in this problem. CWF1 has some common- 
alities with the LLR algorithm we recently developed for a 
completely different problem, that of stochastic combinatorial 
bipartite matching for channel allocation flOl . but is modified 
to account for the non-linear power-rate function in this 



paper. Other recent work on stochastic MAB has considered 
decentralized settings ITTI - lfl4ll . and non-i.i.d. reward pro- 
cesses iT5l - lfl9l . With respect to this literature, the problem 
setting for stochastic water-filling is novel in that it involves 
a non-linear function of the action and unknown variables. In 
particular, as far as we are aware, our CWF2 policy is the 
first to exploit the non-linear dependencies between arms to 
provably improve the regret performance. 

III. Problem Formulation 

We define the stochastic version of the classic communica- 
tion theory problem of power allocation for maximizing rate 
over parallel channels (water-filling) as follows. 

We consider a system with N channels, where the 
channel gain-to-noise ratios are unknown random processes 
Xi(n),l < i < N. Time is slotted and indexed by n. 
We assume that Xi(n) evolves as an i.i.d. random process 
over time (i.e., we consider block fading), with the only 
restriction that its distribution has a finite support. Without loss 
of generality, we normalize Xi(n) £ [0, 1]. We do not require 
that Xi(n) be independent across i. This random process is 
assumed to have a mean 6i — E[Xj] that is unknown to the 
users. We denote the set of all these means by 9 = {0i}. 

At each decision period n (also referred to interchangeably 
as a time slot), an A^-dimensional action vector a(n), repre- 
senting a power allocation on these N channels, is selected 
under a policy n(n). We assume that the power levels are 
discrete, and we can put any constraint on the selections 
of power allocations such that they are from a finite set T 
(i.e., the maximum total power constraint, or an upper bound 
on the maximum allowed power per subcarrier). We assume 
> for all 1 < % < N. When a particular power 
allocation a(n) is selected, the channel gain-to-noise ratios 
corresponding to nonzero components of a(n) are revealed, 
i.e., the value of Xi(n) is observed for all i such that 
di(n) ^ 0. We denote by ,4 a („) = {i : ai(n) ^ 0, 1 < i < N} 
the index set of all ai(n) ^ for an allocation a. 

We adopt a general formulation for water-filling, where the 
sum rate □ obtained at time n by allocating a set of powers 
a(n) is defined as: 

R a (n)(n) = ^2 fi(ai(n),Xi(n)). (1) 

<6.AaC») 

where for all i, fi(di(n), Xi(n)) is a nonlinear continuous 
increasing sub-additive function in Xi(n), and fi(a,i(n),0) = 
for any a,i(n). We assume fi is defined on R + x R + . 

Our formulation is general enough to include as a special 
case of the rate function obtained from Shannon's capacity 
theorem for AWGN, which is widely used in communication 
networks: 

N 

R a (n){n) = ^]log(l + a,i{n)Xi{n)) 

i=l 

2 We refer to rate and reward interchangeably in this paper. 
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In the typical formulation there is a total power constraint and 
individual power constraints, the corresponding constraint is 

N 

T = {a : a, < P total A < a, < P i} Vi}- 

i=l 

where P to tai is the total power constraint and P; is the maxi- 
mum allowed power per channel. 

Our goal is to maximize the expected sum-rate when the 
distributions of all X.- L are unknown, as shown in We refer 
to this objective as Oi. 

maxE[V fi{ ai , X^)} (2) 

Note that even when Xj have known distributions, this is a 
hard combinatorial non-linear stochastic optimization problem. 
In our setting, with unknown distributions, we can formulate 
this as a multi-armed bandit problem, where each power 
allocation a(n) G T is an arm and the reward function is 
in a combinatorial non-linear form. The optimal arms are the 
ones with the largest expected reward, denoted as O* = {a*}. 
For the rest of the paper, we use * as the index indicating that 
a parameter is for an optimal arm. If more than one optimal 
arm exists, * refers to any one of them. 

We note that for the combinatorial multi-armed bandit prob- 
lem with linear rewards where the reward function is defined 
by R a ( n ){ n ) — ai( n )Xi(n), a* is a solution to a deter- 

ministic optimization problem because maxE[ a iXi] = 

max aiWi[Xi\. Different from the combinatorial multi- 

aeJr i£A a 

armed bandit problem with linear rewards, a* here is a solution 
to a stochastic optimization problem, i.e., 

a* eO* = {a:a = argmaxE[y f^X,))]}. (3) 

iEAz, 

We evaluate policies for Ox with respect to regret, which 
is defined as the difference between the expected reward that 
could be obtained by a genie that can pick an optimal arm 
at each time, and that obtained by the given policy. Note that 
minimizing the regret is equivalent to maximizing the expected 
rewards. Regret can be expressed as: 

n 

W(n)=nR*-E[J2R*(t)(t)], (4) 
t=i 

where R* = maxE[ fi{a>i, Xi))], the expected reward of 

an optimal arm. 

Intuitively, we would like the regret W(n) to be as small 
as possible. If it is sub-linear with respect to time n, the time- 
averaged regret will tend to zero and the maximum possible 
time-averaged reward can be achieved. Note that the number 
of arms IJ 7 ) can be exponential in the number of unknown 
random variables N. 

We also note that for the stochastic version of the water- 
filling problems, a typical way in practice to deal with the 
unknown randomness is to estimate the mean channel gain to 



noise ratios first and then find the optimized allocation based 
on the mean values. This approach tries to identify the power 
allocation that maximizes the power-rate equation applied to 
the mean channel gain-to-noise ratios. We refer to maximizing 
this as the sum-pseudo-rate over averaged channels. We denote 
this objective by O2, as shown in ©. 

maxy/,(a„E[I,]) (5) 

aG-F ' — ~ 
i£As. 

We would also like to develop an online learning policy 
for O2. Note that the optimal arm a* of O 2 is a solution 
to a deterministic optimization problem. So, we evaluate the 
policies for O2 with respect to the expected total number 
of times that a non-optimal power allocation is selected. We 
denote by T a (n) the number of times that a power allocation 
is picked up to time n. We denote r a — fi(ai, E[Xj]). 

ieA a 

Let T£ on (n) denote the total number of times that a policy 
7r select a power allocation r a < ? ,a . Denote by l^(a) the 
indicator function which is equal to 1 if a is selected under 
policy 7r at time t, and else. Then 

n 

E[T: on (n)]=n-E[Y / tU^) = l} (6) 
t=i 

= Yl E[T a (n)]. 

IV. Online Learning for Maximizing the Sum-Rate 

We first present in this section an online learning policy for 
stochastic water-filling under object Oi. 

A. Policy Design 

A straightforward, naive way to solve this problem is to 
use the UCBl policy proposed |9|. For UCBl, each power 
allocati on is treated as an arm, and the arm that maximizes 
Yj. + J will be selected at each time slot, where Y% is 
the mean observed reward on arm k, and is the number of 
times that arm k has been played. This approach essentially 
ignores the underlying dependencies across the different arms, 
and requires storage that is linear in the number of arms 
and yields regret growing linearly with the number of arms. 
Since there can be an exponential number of arms, the UCBl 
algorithm performs poorly on this problem. 

We note that for combinatorial optimization problems with 
linear reward functions, an online learning algorithm LLR 
has been proposed in J6) as an efficient solution. LLR stores 
the mean of observed values for every underlying unknown 
random variable, as well as the number of times each has 
been observed. So the storage of LLR is linear in the number 
of unknown random variables, and the analysis in J6] shows 
LLR achieves a regret that grows logarithmically in time, and 
polynomially in the number of unknown parameters. 

However, the challenge with stochastic water-filling with 
objective Oi, where the expectation is outside the non-linear 
reward function, directly storing the mean observations of Xi 
will not work. 
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To deal with this challenge, we propose to store the infor- 
mation for each a*, Xi combination, i.e., VI < i < N, Vctj, we 
define a new set of random variables Yi. ai — fi(a-i,Xi). So 



N 



now the number of random variables Yi >ai is |ft|> where 

i=l 

N 

Bi = { ai : at ^ 0}. Note that J2 \B t \ < PN. 

8=1 

Then the reward function can be expressed as 
Note that (0 is in a combinatorial linear form. 

JV 

For this redefined MAB problem with Y, |ft| unknown 

random variables and linear reward function Q, we propose 
the following online learning policy CWF1 for stochastic 
water- filling as shown in Algorithm Q] 



Algorithm 1 Online Learning for Stochastic Water- Filling: 
CWF1 



tnt+l 



II Initialization 

If max | ,4 a | is known, let L = max \A a \; else, L = N; 

a a 

for n = 1 to N do 

Play any arm a such that n EAg.; 

— 

Vi G -4 a , Vaj G Bu Yi^ at := - 

Vi G A a , nii := m.j + 1; 
end for 
// Main loop 
while 1 do 

n := n + 1; 

Play an arm a which solves the maximization problem 



V^ G A a , Va* G ft, Y 2 , ai 
Vi G A a , m; L := rrii + 1; 
end while 



(L + l)lnn 



); 



(8) 



y<,a < Tra < +/ < (oi,X t ) . 
mi+1 



To have a tighter bound of regret, different from the LLR 
algorithm, instead of storing the number of times that each 
unknown random variables ¥j tti has been observed, we use a 
1 by N vector, denoted as (to^ixat, to store the number of 
times that Xi has been observed up to the current time slot. 

N ' 

We use a 1 by ^ |ft| vector, denoted as (Yi a ) jy 

i=l ' IxEIBil 

i=l 

to store the information based on the observed values. 

(Yi a ) n is updated in as shown in line [12] Each time 

' 1 ix £ 18,1 

i = l 

an arm a(n) is played, Vi G ^4 a (n)> me observed value of Xi 
is obtained. For every observed value of Xi, \Bi\ values are 
updated: Va^ G Bi, the average value Yi ^ of all the values 
of Y,a 4 up to the current time slot is updated. CWF1 policy 

AT 

requires storage linear in |ft|- 

i=l 



fi. Awafysi's o/ regret 

Theorem 1: The expected regret under the CWF1 policy is 
at most 



1) AT Inn 



(A n 



N - 



(9) 



where a„ 



= max max a 

aGJ 7 i 

A max = max R* 

The proof of Theorem [T] is omitted 



a^a* 

E[i? a ]. Note that L < N. 



N 



Remark 1: For CWF1 policy, although there are |ft| 

i=l 

random variables, the upper bound of regret remains 
0(A 4 logn), which is the same as LLR, as shown by The- 
orem 2 in [6]. Directly applying LLR algorithm to solve the 
redefined MAB problem in (0 will result in a regret that grows 
as (9(P 4 A 4 logn). 

Remark 2: Algorithm Q] will even work for rate functions 
that do not satisfy subadditivity. 

Remark 3: We can develop similar policies and results 
when Xi are Markovian rewards as in [19| and l20l . 

V. Online Learning for Sum-Pseudo-Rate 

We now show our novel online learning algorithm CWF2 for 
stochastic water- filling with object O2. Unlike CWF1, CWF2 
exploits non-linear dependencies between the choices of power 
allocations and requires lower storage. Under condition where 
the power allocation that maximize 2 also maximize Ox, 
we will see through simulations that CWF2 has better regret 
performance. 

A. Policy Design 

Our proposed policy CWF2 for stochastic water filling with 
objective O2 is shown in Algorithm [2] 

Algorithm 2 Online Learning for Stochastic Water- Filling: 

CWF2 

// Initialization 

If max \A a \ is known, let L = max |^4 a |; else, L = N; 

a a 

for n = 1 to do 

Play any arm a such that n G A a ; 

Vz G A a , Xi := Xi ™£ Xi , rrn := m, + 1; 
end for 

// Main loop 
while 1 do 

n := n + 1; 

Play an arm a which solves the maximization problem 



max E fi( a i, x i) + M a i, 



(L + l)lnn 



(10) 



11: Vt € A a{n) , Xi := x ^. i + 1 Yi , rm := rm + 1; 
12: end while 

We use two 1 by A vectors to store the information after 
we play an arm at each time slot. One is (JCj)i x jv in which 
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Xi is the average (sample mean) of all the observed values 
of Xi up to the current time slot (obtained through potentially 
different sets of arms over time). The other one is (mj)ixiv in 
which rrii is the number of times that Xi has been observed 
up to the current time slot. So CWF2 policy requires storage 
linear in TV. 

B. Analysis of regret 

For the analysis of the upper bound for E[T^ on (n)] of 
CWF2 policy, we use the inequalities as stated in the Chernoff- 
Hoeffding bound as follows: 

Lemma 1 (Chernoff-Hoeffding bound /127V): 
Xi,...,X n are random variables with range [0,1], and 
E[X t \X u X t _ x ] = fx, VI < t < n. Denote S n = £ X*. 
Then for all a > 



(11) 



P{S„ > rip, + a}< e~ 2a2/n 

P{S n < np -a}< e - 2a2/n 

Theorem 2: Under the CWF2 policy, the expected total 
number of times that non-optimal power allocations are se- 
lected is at most 

J2 



N(L + l)lnn 



B 2 . 

mm 



TV ■ 



-LN, (12) 



where B B 



is a constant defined by <5 n 

fa). 



'mm an d L\ O m in 

mm [r" - 

a:r a <r* 

Proof: 

We will show the upper bound of the regret in three steps: 
(1) introduce a counter Ti{n) (defined as below) and show 
its relationship with the upper bound of the regret; (2) show 
the upper bound of E[Tj(n)]; (3) show the upper bound of 

Ep£ on (n)]. _ 
(1) The counter Tj(n) 

After the initialization period, (Tj(n)) lx jv is introduced 
as a counter and is updated in the following way: at any 
time n when a non-optimal power allocation is selected, find 

i such that i = arg min m,. If there is only one such 

je.4 a (n) 

power allocation, Tj(n) is increased by 1. If there are multiple 
such power allocations, we arbitrarily pick one, say %' , and 
increment Ty by 1. Based on the above definition of Ti(n), 
each time when a non-optimal power allocation is selected, 
exactly one element in (Tj(n))i x jv is incremented by 1. So 
the summation of all counters in (Ti(n))i XJ y equals to the 
total number that we have selected the non-optimal power 
allocations, as below: 



N 



E[T a (n)]=^E[f i (n)] 



(13) 



a:i?. a <_R* 



i=l 



We also have the following inequality for Ti(n): 

Ti(n) < mj(n),Vl <i<N. (14) 
(2) show the upper bound o/E[Tj(n)] 
Let Ct,mi denote \ p^j^-^ - Denote by hin) the indicator 
function which is equal to 1 if Tj(n) is added by one at time 



n. Let / be an arbitrary positive integer. Then, we could get 
the upper bound of E[Tj(n)] as shown in (Q3]l, where a(t) 
is defined as a non-optimal power allocation picked at time 

t when Ii(t) = 1. Note that rrii = min{?7jj : Vj G ^4 a (t)}- 

j 

We denote this power allocation by a(t) since at each time 
that Ii(t) = 1, we could get different selections of power 
allocations. 

Note that I < f t (t - 1) implies, I < f t {t - 1) < 
rrij{t — l),Vj G A a M. So we could get an upper bound 
of E[f t (n)} as shown in (US, G3, CEU, CSllE where hj 
(1 < j < \A a *\) represents the j-th element in ^4 a *; 
Pj (1 < i < |-^a(t)|) represents the j-th element in 

A(t); = E fh {a*h^ e h 3 ) = E fi{a*A))\ r a(i) = 

3=1 iS>t a » 

l-Aa (t ) 

E /pj ( a Pj (* ) > #Pj ) = E fi(ai,0i). 

3=1 i£.Aa 

Now we show the upper bound of the probabilities for 
inequalities ([T7i i. (TT~8T > and ([T9T l separately. We first find the 
upper bound of the probability for ( TT7i i. as shown in (|2TT >. 

Equation (l20l holds because of lemma [T] So Vj, 



fhj (a,* h . , X h . tmh , + C t -i <mh . ) 



< 



fhj (a* hj , X hj: m h . ) + fh 3 [a* h . , C t -l.m hj )■ 



(22) 



(f2Tb holds because Vi, /j(ai, Xj) is a non-decreasing function 
in Xj for any Xj > 0. 

In (EB, VI < j < applying the Chernoff-Hoeffding 

bound stated in Lemma [U we could find the upper bound of 
each item as, 



nx hj , mh .+c t - x , mhj <e hj ) 



< e 



-2(L+1) 



Thus. 



3=1 3=1 
< |^a*|^ 2(L+1) < L(t-l)- 2(i+1) . 

(23) 

Now we can get the upper bound of the probability for 
inequality ( TT8l >. as shown in d24i >. 

Equation (f24l > holds, following a similar reasoning as used 
to derive ( f23l . 

For all i and given any aj, since fi(a,i,x) is an increasing, 
continuous function in x, we could find a constant Bi(ai) such 
that 

/ < (o 4 ,B < (a i )) = ^. (25) 
Denote -B m ; n (a) = min BAaA. Then Vi S ^4 a , we have 

«£.4a 



2L 



(26) 



These equations are on the next page due to the space limitations. 
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E[Ti(n)]= J2 PUi(*) = !}<«+ E PW) = l,2i(t-l)>0 

t=iV+l t=iV+l 
n 

<'+ E P { E (/iK-.^,m,(t-l)) + / J -(a j *,Ct-l,m,(t-l))) (15) 

t=N+i jeA^ 

< E (/i(ai(*).^, m i(t-i)) + Ct-Wt-i))) - 1) > /}. 



I -4a 



E[T,(n)] <Z + E P C min E (X K > ) + « ' ^-i.-^ )) 

t =N+l U<mh l'---' m ' 1 |^a»l <t j = l ^ ' 

< max E (/pi( a Pi(*)'^.m»J+/pi( o Pi(*)' c *-i."»pj)} 

- P1 ' P l^a(t)l j = l 

OO t-1 t-1 t-1 t-1 |-4a* 

< Z +E E ••• E E ••• E p {E (hM^^+hM^t-^i) 

t=2 m hl =l m h\ A *i =1 m P1 =i m P|yi (t) | =i 3 = 1 

l-4a(t) 

< E (/ P3 KW^ K ,™ Pj ) + / ft KW.^i, mpj ))} (16) 

oo t-1 t-1 t-1 t-1 

< i + E E ' ' ' E E ' ' ' E P{At least one of the following must hold: 

t=2m hl =l m '>|^»| =1 m P1 =l m P\ A t | =i 

i -4a* I |-4a* 

E fh 3 {a*h^ X h 3 , mhj ) < r* - E .fh^al^Ct-i.m^), (17) 

-4 a (t)| |-4a(t)| 

E /w^W.Ct-i,^). (18) 
j=i j=i 

l-4a(t)| 

r* < r a(t) + 2 E /« K (*)> C-t-i,^ )} (19) 



| -4-a* | l-4-a* | 

p {E /^K.^,m h .)<r*- E 

I -4-a* | |-^a* | 

= p {E {fhMh p X hj , mhj ) + f hj {a* hj ,C t _ hmhj )) < E hMh^hi)} 

l-4-a* | 

< E K - ) + h 3 K 3 . , c t - ltmhj ) < hj « . ,e hj )} 

|»Aa* | 

< ^ P{ /„, (a^ , ^ + C t - hmhj ) < f hj (a* h] ,6 hj )} (20) 
i=i 

|-4 a ,| 

= E 1P{^^. + Ct-l,m hj < hj ) (21) 
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3=1 3=1 

= p {E feK-W.^,™,-)^ E (/wKW.^l + feKW.C'-i.^))} 

3=1 3=1 
l-4 a (t)l 

< E P ^ W.^,^ ) > f Pj (a Pj (t),6 Pi ) + f Pj (a P] (t), C t _!, mp . )} 

3=1 

l-A»«l 

< E P {/« K W> X W^ Pj ) ^ /ft K (*)' ft + °t-hm Pj )} 

3=1 

|A*(t)| 

= E p {^ f 

3=1 



(24) 



Note that for Z > 



(L+l) Inn 
Bii»(a(t)) 



l-4 a(t) l 

r * - r a(t) - 2 E fPi ("ft (*)' ^-1.™^ ) 
3=1 



r — r 



"p., 



a(t) - 2 2^ ^ K' W» V ) 

3=1 

l-*a W l 

> r* - r a(t) - 2 ^ / Pj (a Pj (t), 

3=1 



(L + 1) Inn, 



>r -r B (t)- 



\— > /(L + 1) Inn. 

2 E UMvMxl- r ) 



3=1 

IA,(t)l 



I 



>r*-r a(t) -2 £ f Pj (a Pj (t),B min ( a (t))) 



3 = 1 



IA»(t)l 



> #a(t) _ 2 E "fP - 5a W ~ 5min - °- 



3 = 1 



So £[9]l is false when Z > 



(L + l) Inn 
(L+l) Inn 

s 5 " — 



(27) 

We denote /3 m i n = 
, then (T% is false 



min B m ; n (a(i))), and let I > 
for all a(i). 

Therefore, we get the upper bound of E[Ti(n)] as in 
(3) Upper bound of l$L[T£ on (n)] 

Epa»]= E 



a:Ra<R' 



N 



(29) 



»=i 



< 



JV(L + l)lnn 



B 2 . 

mm 



N + —LN. 
3 



Remark 4: CWF2 can be used to solve the stochastic water- 
filling with objective Oi as well if 3a* £ O* , such that Va ^ 
£>*, 

E hM))> E ( 3 °) 



Then the regret of CWF2 is at most 



CVKF2 



(n) < 



7V(L + l)lnn 



B 2 

min 



7V + _ LiV 
3 



(31) 



VI. Applications and Numerical Simulation 
Results 

A. Numerical Results for CWF1 

We now show the numerical results for CWF2 policy. We 
consider a OFDM system with 4 subcarriers. We assume the 
bandwidth of the system is 4 MHz, and the noise density is 
—80 dBw/Hz. We assume Rayleigh fading with parameter o = 
(2, 0.8, 2.80.32) for 4 subcarriers. We consider the following 
objective for our simulation: 



max E 



' N 

E 

i=l 



log(l +a t (n)X i (n)) 



s.t. 



N 

E 

i=l 



Ot(n) < Ptotai, Vn 



ai(n) e {0,10, 20, 30}, Vn 
a 2 (n) G {0,10, 20, 30}, Vn 
a 3 (n) e {0,10, 20, 30, 40}, Vn 
a 4 (n) e {0,10, 20}, Vn 



(32) 

(33) 

(34) 
(35) 
(36) 
(37) 



where P tota ] = 60m W (17.8 dBm). The unit for above power 
constraints from d34]l to d37]l is mW. Note that d33]l to (l37l > 
define the constraint set J 7 . 

For this scenario, there are 140 different choices of power 
allocations, and the optimal power allocation can be calculated 
as (20,20,20,0). 



s 



Epi(n)] < 



(L + 1) Inn' 



t—i t-i 



E E ••• E E ••• E 

t=2 I 771^=1 m '»|^*| =1 ' 

(L + l)lnn ^ , (L + l)lnn 



7T 

1 + T L - 



(28) 




We compare the performance of our proposed CWF1 policy 
with UCB 1 policy and LLR policy, as shown in Figure Q] As 
we can see fromQ] naively applying UCB1 and LLR policy 
results in a worse performance than CWF1, since the UCB1 
policy can not exploit the underlying dependencies across 
arms, and LLR policy does not utilize the observations as 
efficiently as CWF1 does. 

B. Numerical Results for CWF2 

We show the simulation results of CWF2 using the same 
system as in IVI-AI 

We consider the following objective for our simulation: 



max 



s.t. 



' N 



log(l + ai(n)K[Xi(n)]) 
ae J 



.1=1 



(38) 



where T is same as in IVI-AI 

For this scenario, we assume Rayleigh fading with param- 
eter er = (1.23,1.0,0.55,0.95) for 4 subcarriers. And the 
optimal power allocation can be calculated as (20, 20, 0, 20). 

Figure [2] shows the simulation results of the total number 
of times that non-optimal power allocations are chosen by 
running CWF2 up to 30 million time slots. We also show the 
theoretical upper bound in figure [2] In this case, we see that 
the theoretical upper bound is quite loose and the algorithm 
does much better in practice. 

For this setting, we note that (f30b is satisfied, since 
(20, 20, 0, 20) also maximizes (l32l . So as stated in Remark 
21 CWF2 can also be used to solve stochastic water filling 




■ UCB 
- LLR 
-CWF1 

CWF2 



5 6 
Time 



9 10 
x 10 5 



Fig. 3. Normalized regret 



log rt 



vs. n time slots. 



with Oi, with regret that grows logarithmically in time and 
poly normally in the number of channels. 

We show a comparison of the UCB1 policy, LLR policy, 
CWF1 policy and CWF2 policy under this setting in Figure 
[3] We can see that CWF2 performs the best by far since it 
incorporate a way to exploit non-linear dependencies across 
arms, and learn more efficiently. 

VII. Conclusion 

We have considered the problem of optimal power allocation 
over parallel channels with stochastically time-varying gain- 
to-noise ratios for maximizing information rate (stochastic 
water-filling) in this work. We approached this problem from 
the novel perspective of online learning. The crux of our 
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approach is to map each possible power allocation into arms in 
a stochastic multi-armed bandit problem. The significant new 
challenge imposed here is that the reward obtained is a non- 
linear function of the arm choice and the underlying unknown 
random variables. To our knowledge there is no prior work on 
stochastic MAB that explicitly treats such a problem. 

We first considered the problem of maximizing the expected 
sum rate. For this problem we developed the CWF1 algorithm. 
Despite the fact that the number of arms grows exponentially 
in the number of possible channels, we show that the CWF1 
algorithm requires only polynomial storage and also yields a 
regret that is polynomial in the number of power levels per 
channel and the number of channels, and logarithmic in time. 

We then considered the problem of maximizing the sum- 
pseudo-rate, where the pseudo rate for a stochastic channel 
is defined by applying the power-rate equation to its mean 
SNR (log(l + E[SNR]). The justification for considering this 
problem is its connection to practice (where allocations over 
stochastic channels are made based on estimated mean channel 
conditions). Albeit sub-optimal with respect to maximizing 
the expected sum-rate, the use of the sum-pseudo-rate as 
the objective function is a more tractable approach. For this 
problem, we developed a new MAB algorithm that we call 
CWF2. This is the first algorithm in the literature on stochastic 
MAB that exploits non-linear dependencies between the arm 
rewards. We have proved that the number of times this policy 
uses a non-optimal power allocation is also bounded by a 
function that is polynomial in the number of channels and 
power-levels, and logarithmic in time. 

Our simulations results show that the algorithms we develop 
are indeed better than naive application of classic MAB 
solutions. We also see that under settings where the power 
allocation for maximizing the sum-pseudo-rate matches the 
optimal power allocation that maximizes the expected sum- 
rate, CWF2 has significantly better regret-performance than 
CWF1. 

Because our formulations allow for very general classes of 
sub-additive reward functions, we believe that our technique 
may be much more broadly applicable to settings other than 
power allocation for stochastic channels. We would therefore 
like to identify and explore such applications in future work. 
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